This dataset wes retrieved from a Brazilian government website (http://dados.gov.br/dataset/chegada-turistas , in Portuguese) and contains the number of foreign entries in Brazilian territory divided by State. Data from 2005 to 2015 is being merged from 10 different CSV files. There are 8 variables: Country of Origin, Continent, State they arrived, Access method, Year, Month as Text, Month as number and number of visitors that month.
Unfortunately, all names used are in Portuguese, which should be OK for Country names but not that intuitive for access method: ‘aérea’ means ‘by air’, ‘fluvial’ means ‘by river’, ‘maritima’ means ‘by sea’ and ‘terrestre’ means by land, like by car, bus, on foot, etc. Month names will be avoided, and Month numbers will be used instead.
Before dealing with the data,I have made some previous hypothesis: the majority of visitors in internationally famous cities such as Rio de Janeiro will come from developed countries during Carnival (which happens in February) and big commercial cities such as Sao Paulo will have the largest number of foreign visitors and they will be well distributed over the year.
Some internal codes from this dataset were removed, which are probably used by databases from the government and really do not have any value to our study.
## Continent Country
## Europa :93468 Outros países: 22620
## América do Sul :54288 África do Sul: 4524
## Ásia :28680 Alemanha : 4524
## África :22620 Angola : 4524
## América Central e Caribe:21852 Argentina : 4524
## América do Norte :13572 Austrália : 4524
## (Other) :17328 (Other) :206568
## State Access Year
## Outras Unidades da Federação: 28704 Aérea :102912 Min. :2005
## Rio Grande do Sul : 28032 Fluvial : 32736 1st Qu.:2007
## Paraná : 24672 Marítima : 68736 Median :2010
## Santa Catarina : 22032 Terrestre: 47424 Mean :2010
## Amazonas : 14688 3rd Qu.:2013
## Bahia : 14688 Max. :2015
## (Other) :118992
## Month Month.Number Arrivals
## abril : 20984 Min. : 1.00 Min. : 0.0
## agosto : 20984 1st Qu.: 3.75 1st Qu.: 0.0
## dezembro : 20984 Median : 6.50 Median : 0.0
## fevereiro: 20984 Mean : 6.50 Mean : 240.4
## janeiro : 20984 3rd Qu.: 9.25 3rd Qu.: 11.0
## julho : 20984 Max. :12.00 Max. :353122.0
## (Other) :125904 NA's :1920
The present dataset has 251,808 entries in total, but the only numeric variables are the number of arrivals and month of the year.
Let’s start with the count of registers whose number of accesses is different of zero per continent. For some reason, there is a category ‘Non specified Continent’ which makes no sense, so it will be removed. Please note the graph had to be flipped to better accomodate the title of the bars.
## Warning: Ignoring unknown parameters: binwidth, bins, pad
Looking at the graph it seems the majority come from Europe, but since our data is divided by Country too, and Europe has lots of small countries, it does not mean necessarily more visitors. The following graph is a look into those ‘Non identified Continents’. It seems they arrived mostly in the states of Sao Paulo and Rio de Janeiro, and we will look into it later but this is the first mistery our dataset raised.
## Warning: Ignoring unknown parameters: binwidth, bins, pad
After checking to which State those “non specified continent” entries headed, there is another weird thing about this dataset: some states are merged in one groups called “Other States”. Let?s check how many states we have in total:
## Warning: Ignoring unknown parameters: binwidth, bins, pad
There are 17 states out of the 26. The missing states are likely merged in this “other states” variable. And visually, this group is the 3rd larger in number of non zero entries. So it makes absolutely no sense to keep this in the database, it should be split in the missing states. That is some good example of Brazilian public services efficiency we are all proud of.
We can also check which access method has more entries.
## Warning: Ignoring unknown parameters: binwidth, bins, pad
By far the vast majority of entries are ‘by air’, and this makes sense since Brazil is far from Europe, North America and is considered a continental sized country. It would not be a surprise if all those entries ‘by land’ are from all those ‘Latin America’ entries from the Continent graph.
Next there is the count of entries per year, removing entries with zero arrivals.
## Warning: Ignoring unknown parameters: binwidth, bins, pad
There are three recognizable peaks in the graph for 2006, 2008 and 2015. It is interesting that the soccer world cup of 2014 is not a peak, although has a large number.
As mentioned before, it would be expected to see some peaks during summer in south hemisphere due to tourism.
## Warning: Ignoring unknown parameters: binwidth, bins, pad
As expected, the difference for the period of November to February can be seen clearly in our last graph.
For our last analysis, we observe the number of arrivals, first overall, then for all values different then zero.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0 0.0 0.0 240.4 11.0 353122.0 1920
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1 3 14 509 114 353122
With a mean of 240 and max of 353122, it seems likely that some very large values are pulling the mean up, specially because the 3rd quartile is 11. However, there are 1920 NA values where it should be none. After further investigation, it seems the file for year 2007 had this problem, as shown below.
## Continent Country
## América Central e Caribe:768 Cuba :384
## Europa :768 Guatemala :384
## Ásia :384 Índia :384
## África : 0 República Tcheca:384
## América do Norte : 0 Rússia :384
## América do Sul : 0 África do Sul : 0
## (Other) : 0 (Other) : 0
## State Access Year
## Outras Unidades da Federação:240 Aérea :720 Min. :2007
## Paraná :240 Fluvial :240 1st Qu.:2007
## Rio Grande do Sul :240 Marítima :600 Median :2007
## Santa Catarina :180 Terrestre:360 Mean :2007
## Amazonas :120 3rd Qu.:2007
## Bahia :120 Max. :2007
## (Other) :780
## Month Month.Number Arrivals
## abril :160 1 :160 Min. : NA
## agosto :160 2 :160 1st Qu.: NA
## dezembro :160 3 :160 Median : NA
## fevereiro:160 4 :160 Mean :NaN
## janeiro :160 5 :160 3rd Qu.: NA
## julho :160 6 :160 Max. : NA
## (Other) :960 (Other):960 NA's :1920
In order to work with a more reasonable dataset, we will consider all those to be zeroes, and not some error from the government system.
After running performing this fix, the number of NAs went ot zero, as shown below.
## Continent Country
## África :0 África do Sul :0
## América Central e Caribe :0 Alemanha :0
## América do Norte :0 Angola :0
## América do Sul :0 Arábia Saudita:0
## Ásia :0 Argentina :0
## Continente não especificado:0 Austrália :0
## (Other) :0 (Other) :0
## State Access Year
## Amazonas :0 Aérea :0 Min. : NA
## Bahia :0 Fluvial :0 1st Qu.: NA
## Ceará :0 Marítima :0 Median : NA
## Mato Grosso do Sul :0 Terrestre:0 Mean :NaN
## Outras Unidades da Federação:0 3rd Qu.: NA
## Pará :0 Max. : NA
## (Other) :0
## Month Month.Number Arrivals
## abril :0 1 :0 Min. : NA
## agosto :0 2 :0 1st Qu.: NA
## dezembro :0 3 :0 Median : NA
## fevereiro:0 4 :0 Mean :NaN
## janeiro :0 5 :0 3rd Qu.: NA
## julho :0 6 :0 Max. : NA
## (Other) :0 (Other):0
This Dataset consists of 251,808 entries containing 8 variables, from which only 7 will be properly used. It describes the foreign visitors that entered Brazilian border providing number of arrivals in one month, number of month, year, country of origin, continent of country of origin, in which state they arrived and the method of access (by air, land, river or sea).
After analyzing distributed data, one thing drawed my attention: the number of arrivals from a “unknown country or continent”. Before digging deeper into it, it seems weird that a government database would have such entries, specially because of illegal immigration control.
The main interest of this Dataset is to understand how foreign visitors enter Brazil. The variable that will be most used in every graph and analysis is the total arrivals, which will be evaluated monthly, yearly, and in all possible situations.
Region of the state might help show where the main flow of foreigners is. Also, the month can help visualize seasonal events like Carnival in some States. It might be possible to correlate visitor flow throught years with global events, such as the 2008 economic crisis in the USA and the World.
Not yet, but it might be necessary if I want to plot the months in sequence year after year.
It was not necessary yet.
The first thing we can see is how the variable of our interest (number of arrivals) behaves through the years.
It is observable that there has been a slight increase in the number of foreign visitors since 2013. It is also possible to see how the accumulated visits are distributed through the continents on the graph below.
From this graph we can understand that this increase is specially because of South America. For the other two continents with most visitors, Europe and North America, there has been a slight decrease in arrivals.
Below we have the graph of arrivals per continent in this 10 year window. South America and Europe have by far the larger number of visitors.
Next we observe how the sum of arrivals is ditributed trhough the months of the year. As expected, there is a larger number of arrivals during summer time in Brazil. However, there is an unexplained increase in July. So far, one possibility is due to the summer hollidays in the north hemisphere.
Still while trying to understand the seasonal behaviour of international visitors, a new dataframe was created grouping the monthly sum of visitors per month, and a boxplot was created. As seen below, January is by far the month with most visitors. This single dot on June with over a million visitors happened in 2014, precisely during the soccer world cup.
The next plot shows how those international visitors got here. It is reasonable to suppose that the large majority of those land entries are from South American countries. Also, visitors that entered the country by air sums up more than twice all other categories together.
Next plot is about which State do visitors arrive. It is important to note that this does not necessarily mean this is were they are headed: Sao Paulo often has the cheapest fares and larger capacity for international flights, so many visitors arrive there and then get a domestic flight to another State.
In order to analyse each country, I divided the graphs by continent to avoid the graph to get messy. On the next graphs it is evident how this idea of using a “other countries” category is wrong and wastes lots of what could be useful information. In Africa, Asia and Central America, this category has very large values. Despite the fact that together they sum up to around 1.2 Million visitors, which is around 2% of the nearly 50 Million visitors over the 10 years used, they play an important role if studying each continent individually.
Number of arrivals is mainly composed of three continents: South America, Europe and North America. Vast majority arrived in the country by air, second most common was by land. The State that received the most visitors was Sao Paulo. We observed some seasonality in the monthly distribution of visitors, and that the total number of visitors is increasing through the last years (since 2010).
Since the only numeric value is the number of arrivals, it is not yet possible to make this association. We will explore it in the next session of “Multivariate Abalysis”.
The largest number of visitors come from Argentina with around 15 million visitors over those years. Another interesting fact is the ammount of visitors by air. Finally, the distribution of visitors over the months was also very interesting.
In order to better understand the big picture, we analyse the distribution of total visitors per continent per access method. The first impression is that if visitors enter Brazil by air more than twice they enter by land, if we see continents individually, South America has almost the same number of visitors in both access methods. One interesting point we can see in this graph is the large number of europeans entering Brazil by land.
We will further study the two continents with the most visitors: Europe and South America. The next graph shows which countries in South America visits Brazil the most. Our “hermanos” from Argentina won this one.
The next graph attempts to correlate countries from South America and the Brazilian State they used to enter Brazil, per year, all done by land. The scale had to be enlarged to make some details visible. Arrivals equals to zero have been removed from the graph, to make it cleaner. As expected, visitors arrive in the state closest to the border between the countries. Because the difference in numbers between countries was too big, I used the sqrt function on the number of arrivals, so it is possible to see the smaller values. Here we have some interesting facts: Venezuela, Peru, Bolivia and Guiana changed the State they enter Brazil since 2014. Also, some argentinians are entering Brazil from Roraima, a State on the northern border. Why would they travel that far?
Appplying the same criteria for visitors that arrived by air, we notice the large majority entered Brazil in São Paulo and Rio de Janeiro. Also, it seems that international flights to Roraima started only in 2014, since there are no records before that. Visitors from Guiana and Guiana Francesa dropped drastically around 2009. Overall, there is an increase in visitors by air.
The next two graphs are an attempt to do the previous analysis but from european visitors. Europe has more countries, so it would be really messy to create a large list of graphs. The idea on the next graphs is that each pair state-country has a dot. This dot is composed of may rings: the outer ring is the year 2005 and the inner dot represents 2015. The color of each ring is a gradient representing the number of visitors, where red meand the most and blue the least. So, red dots means lots of visitors, purple dots average, blue dots means few visitors. If the inside of the circle is more red, it means the number of visitors is increasing. If the outer ring is more red, it means it is decreasing.
The following graphs from this analysis will be done using real maps to plot flux of visitors. This is something I wanted to learn since a long time ago, and now it seems the perfect opportunity. First we have the land entries for South American visitors as a connection from the country they came to the state they entered. This graph took a really long time and effort to be produced but it was really satisfactory because it shows a lot of things happening in latin america over the last 10 years. So the thickness of the line segment is proportional to the number of arrivals. The transparency is set to 0.5, and the the color of each line varies depending on the year from red to blue. So if arrivals are about the same through the 10 years, we should see a purple line, but if they changed we see a countour with the color of the year with the largest value. From this graph we learn that our neighbours from Argentina are traveling entering more recently by land in Rio Grande do Sul and Parana, whereas our neighbours from Uruguay seem to have used this access more often in the past because of the blueish colour. It is important to remember that this does not mean this state is their final destination, just their entrance in the country.
## Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=Bolivia&zoom=4&size=640x640&scale=2&maptype=roadmap&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=Bolivia&sensor=false
Finally, last but not least, we have an image showing the accumulated arrivals over 10 years per state, separated per continent. This is interesting to see how the different regions of Brazil has indeed a different kind of visitor. North and Northeast is a more touristic region, with tropical sunny beaches, and indeed attract more european visitors, even in absolute numbers. The south seems to receive more south americans, probably from Argentina from what we have seem so far.
## Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=Brasil&zoom=4&size=640x640&scale=2&maptype=roadmap&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=Brasil&sensor=false
Next thing to be analysed is the seasonality of international visitors. First we verify how each continent contributed over the 10 year interval.
There are clearly three main players in foreign visitors in Brazil: North America, Europe and South America. However, we wish to learn about seasonality, therefore we must see the distribution of visits along the years, each month. This can be seen on the graph below.
The outlier on June for every continent happened during the World cup of 2014. But besides that, we can observe some interesting points: Asia and North America appear to have more evenly distributed visitors, which suggests more a professional nature of the visits (not tourism). Europe and South America, on the other hand, seems to have more visitors during summer time.
It is noticeable a huge difference between South America arrivals and the rest of the world. For this reason, we can already skip a few steps in this case and use information of previous graphs about which states receive the most foreign visitors from South America by Land and make a smaller list.
In addition to the map that correlates States and Countries, it provides an intereting insight about land access to Brazil: it happens mainly through the States of Paraná and Rio Grande do Sul and happens mostly for tourism (seasonaly).
As the last graph, a model to quantify which states are the most “internationalized” and most turistic is proposed. First, the absolute number of arrivals is no longer used, but instead it was replaced by the which percentage of the State population it represents. This metric seems to make sense since a smaller State which receives lots of international visitors would be more apparent, whereas a large State which doesn´t increase it´s number of citizens during the holidays would not be that noticeable. To do this, another data source was needed, the population of each State, but unfortunately it is measured only every 10 years by our census. It is available on:
To make visualization easier, the 12 months were grouped in two main categories: warmer months and colder months. The idea behind this is the following: if we see how temperatures are distributed anually in a colder southern state such as Santa Catarina for instance, we notice that around november weather is already pretty nice to go to the beach, and it will be ok until around March.
Here´s some data from Wikipedia from my city, Balneário Camboriu, for an example:
| Month | Year Average high °C (°F) | Daily mean °C (°F) | Average low °C (°F) |
|---|---|---|---|
Jan |
29.0 (84.2) |
24.2 (75.6) |
19.8 (67.6) |
Feb |
28.8 (83.8) |
24.1 (75.4) |
19.6 (67.3) |
Mar |
28.3 (82.9) |
23.5 (74.3) |
18.7 (65.7) |
Apr |
25.8 (78.4) |
20.8 (69.4) |
15.8 (60.4) |
May |
23.8 (74.8) |
18.4 (65.1) |
13.1 (55.6) |
Jun |
22.1 (71.8) |
16.7 (62.1) |
11.3 (52.3) |
Jul |
21.3 (70.3) |
15.8 (60.4) |
10.4 (50.7) |
Aug |
21.5 (70.7) |
16.5 (61.7) |
11.6 (52.9) |
Sep |
22.1 (71.8) |
17.8 (64) |
13.6 (56.5) |
Oct |
23.7 (74.7) |
19.5 (67.1) |
15.3 (59.5) |
Nov |
25.4 (77.7) |
21.0 (69.8) |
16.6 (61.9) |
Dec |
27.3 (81.1) |
22.7 (72.9) |
18.2 (64.8) |
So this last graph was built plotting these percentages of each State population, per month, categorized if this month is either a hot or cold month according to the criteria described previously, with a very low alpha value. That way, if a similar number of occurrences for both hot and cold categories happen we should see a purple color. Otherwise, we should see a tendency towards blue or red. Also, the size of the the point is proportion to the percentage of the arrivals regarding the size of each state population, hence bigger points means bigger percentages.
So, according to this model, the most international State (and it is important to remember that it doesn´t mean those visitors stayed in this state, maybe they were just passing by) is Rio Grande do Sul.
## Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=Goias&zoom=4&size=640x640&scale=2&maptype=roadmap&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=Goias&sensor=false
## Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=Goias&zoom=4&size=640x640&scale=2&maptype=roadmap&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=Goias&sensor=false
The access method,state and country of origin played an important role in determining the feature of interest. Another relationship observed was the kind of visitor being seasonal or distributed along the year.
An interesting interaction happened to the peak in South American visitors entering Brazil only through the southern states.
A model to predict seasonal visitors and how much each state is internationalized was created. It has some limitations because of the population source, which is is taken every 10 years so is not very accurate. Also, visitors not necessarily stay in the state they enter the country, so this is a factor one should always pay attention to.
This plot is composed of two graphs. It shows all arrivals from Europe, and it was chosen because despite South American visitors being in greater numbers, they are very similar to Brazilian culturally. So, the foreign continent with largest visitors is Europe.
The two graphs correlates which european countries come to each Brazilian state using two access methods: by air and by land. Also, it has a way to use color to show both how many visitors are coming, so we can compare which country has the most visitors, but also how it is changing over the last 10 years. This is done looking at the color region in each circle: the center means year 2015, the outer ring means 2005. For an example: a circle with red center and blue outer ring means that in 2015 there was a large number of visitors and in 2005 a small number.
Looking at the graph of air arrivals, we notice Sao Paulo and Rio de Janeiro are the states with most visitors, and Germany, France, Italy, Portugal and Spain the countries that sent most visitors to Brazil. This is achieved by looking which row or column has the most red or purple-ish color. By looking at the coloration of the rings, we notice a small change from all those countries over the years from Sao Paulo to Rio de Janeiro, most noticeable in France and Italy. Also, using this same ring color parameter, we notice a decrease in visitors from portuguese visitors to the northeastern states, and that the only state that seem to have an increase in visitors in the last years is Rio de Janeiro.
The graph of land entries from Europeans seemed odd to me at first but made sense after a few google searches. We notice a massive flux of visitors entering Brazil by land in the state of Parana, and it can be explained by people doing road trips, mostly hitchhiking, through all South America (those Europeans have no idea of the danger, really…) and a famous route enters Brazil through Parana because of Iguazu Falls, the largest waterfall system in the world.
## Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=Bolivia&zoom=4&size=640x640&scale=2&maptype=roadmap&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=Bolivia&sensor=false
This plot is interesting because it might indicate how different the south is in terms of integration with the rest of South America. The vast majority of entries happen on the three southern most states, and this has important social and economical results in those regions.
It is also possible to notice that the north state of Roraima and Acre started receiving visitors recently. Google could not answer why it happened, so by looking at our data, we don´t have registries previous to 2014, what suggest the government probably used to put them in the “Other Federation States” category (yes, that level of chaos is called Brazilian government).
## Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=Goias&zoom=4&size=640x640&scale=2&maptype=roadmap&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=Goias&sensor=false
## Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=Goias&zoom=4&size=640x640&scale=2&maptype=roadmap&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=Goias&sensor=false
Those two graphs are a model to represent which states receives the most international visitors as a percentage of the state population. This is measured by the size of the the dots. Other thing measured is which state is more subject to seasonality, which is was measured dividing the months of the year in hot ones from October to March and cold ones being the rest. The dots also have alpha in their color, which means theh are a bit transparent, and since manhy will be printed on top of each other, if we see a red predominant color it means most visits happened in the hot months, if the color is blue then it was in the cold months, and if it is purple there os no seasonality and they are distributed.
Some surprises happened here because I expected a bigger seasonality for Rio de Janeiro and not such a big one for Mato Grosso do Sul and Rio Grande do Sul.
Another weird surprise happened in Amazonas, which seem to receive more visitors in the cold months, by a very slight difference.
A key point of this whole work was finding out how the lack of a decent road system in Brazil affects the whole south american continent. The huge difference between visitors on southern states and northern states is easily explained after a few google searches on the quality of roads on northern states. This could be one of the possible answers to the why so many visitors from Bolivia, Venezuela and Colombia come all the way south to Mato Grosso do Sul to enter Brazil by Land. Another option involve illegal recreational chemical substances and the largest city of Brazil being Sao Paulo (which places Mato Grosso in the middle), but let´s presume the inocence of our esteemed anonymous visitors who cannot even defend themselves with a nice excuse in this report, right?
Learning how to work with google maps was a bit difficult at first, and to be honest some aspects of it are still a mistery: building routes, getting time estimates, and even if any integration with street view is possible, for an example. But overall it was a rewarding experience.